Consumer Complaint Database

An overview of consumer complaint behavior and company responses

By Luis Alarcon

Disclaimer

I endeavour in this project out of my own volition. It is the first time that I come in contact with this data. I carry no bias before carrying this Exploratory Data Analysis, and the primary goal is to further my knowledge in the methods as well as an understanding of different areas through data analysis.

Introduction

Dataset General Information

The full name of the data set is “Inpatient Prospective Payment System (IPPS) Provider Summary for the Top 100 Diagnosis-Related Groups (DRG) - FY2011”. It is own and updated by CMS.

The last update was on May 29, 2014, and the metadata was updated on August 30, 2017. For access to the data set as well as further information, please go to this link.

Dataset Contents

The Dataset has 12 columns. These are the names and the description of them:

This information it is copied straight from the pdf attachment found in this link.
  1. DRG Definition
  2. The code and description identifying the MS-DRG. MS-DRGs are a 
    classification system that groups similar clinical conditions (diagnoses) 
    and the procedures furnished by the hospital during the stay.
  3. Provider Id
  4. The CMS Certification Number (CCN) assigned to the Medicare certified 
    hospital facility.
  5. Provider Name
  6. The name of the provider.
  7. Provider Street Address
  8. The provider’s street address.
  9. Provider City
  10. The city where the provider is located.
  11. Provider State
  12. The state where the provider is located.
  13. Provider Zip Code
  14. The provider’s zip code.
  15. Provider HRR
  16. The Hospital Referral Region (HRR) where the provider is located.
  17. Total Discharges
  18. The number of discharges billed by the provider for inpatient 
    hospital services.
  19. Average Covered Charges
  20. The provider's average charge for services covered by Medicare 
    for all discharges in the MS-DRG. These will vary from hospital to 
    hospital because of differences in hospital charge structures.
  21. Average Total Payments
  22. The average total payments to all providers for the MS-DRG including 
    the MS-DRG amount, teaching, disproportionate share, capital, and outlier
    payments for all cases. Also included in average total payments are 
    co-payment and deductible amounts that the patient is responsible for and 
    any additional payments by third parties for coordination of benefits.
  23. Average Medicare Payments
  24. The average amount that Medicare pays to the provider for Medicare's share 
    of the MS-DRG. Average Medicare payment amounts include the MS-DRG amount, 
    teaching, disproportionate share, capital, and outlier payments for all 
    cases. Medicare payments DO NOT include beneficiary co-payments and 
    deductible amounts norany additional payments from third parties for 
    coordination of benefits. Note: In general, Medicare FFS claims with 
    dates-of-service or dates-of-discharge on or after April 1, 2013, incurred 
    a 2 percent reduction in Medicare payment. This is in response to mandatory 
    across-the-board reductions in Federal spending, also known as 
    sequestration. For additional information, <a href= "http://www.cms.gov/Outreach-and-Education/Outreach/FFSProvPartProg/Downloads/
    2013-03-08-standalone.pdf">visit</a>. 

The initial process of refining the dataset was done in a python script named “u_project_EDA-CMS_python3.ipynb”, the code is presented in “u_project_EDA-CMS_python3.html”; for further details examine the HTML file.

Exploratory Data Analysis (EDA)

EDA: Initial Data Exploration

Data Summary

Firstly, we load the packages that we are going to use in this project.

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Secondly, we load the dataset from its original location to the data frame df_cms.

Before we continue, we check if there are any null values within the data frame df_cms.

There are no null values in the data frame. Afterwards, we check the number of entries present in this dataset.

## [1] 163065     16

The number of entries is 163065. The number of entries is significant. Therefore, We proceed to make a quick summary of the data.

##                                                                   DRG.Definition  
##  194 - SIMPLE PNEUMONIA & PLEURISY W CC                                  :  3023  
##  690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC                         :  2989  
##  292 - HEART FAILURE & SHOCK W CC                                        :  2953  
##  392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC            :  2950  
##  641 - MISC DISORDERS OF NUTRITION,METABOLISM,FLUIDS/ELECTROLYTES W/O MCC:  2899  
##  871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC                :  2812  
##  (Other)                                                                 :145439  
##   Provider.Id                       Provider.Name   
##  Min.   : 10001   GOOD SAMARITAN HOSPITAL  :   633  
##  1st Qu.:110092   ST JOSEPH MEDICAL CENTER :   427  
##  Median :250007   MERCY MEDICAL CENTER     :   357  
##  Mean   :255570   MERCY HOSPITAL           :   347  
##  3rd Qu.:380075   ST JOSEPH HOSPITAL       :   343  
##  Max.   :670077   ST FRANCIS MEDICAL CENTER:   277  
##                   (Other)                  :160681  
##              Provider.Street.Address      Provider.City   
##  100 MEDICAL CENTER DRIVE:   183     CHICAGO     :  1505  
##  800 WASHINGTON STREET   :   166     BALTIMORE   :  1059  
##  1 MEDICAL CENTER DRIVE  :   142     HOUSTON     :   950  
##  100 HOSPITAL DRIVE      :   106     PHILADELPHIA:   898  
##  100 MICHIGAN ST NE      :   100     BROOKLYN    :   877  
##  1000 BLYTHE BLVD        :   100     SPRINGFIELD :   807  
##  (Other)                 :162268     (Other)     :156969  
##  Provider.State   Provider.Zip.Code Hospital.Referral.Region.Description
##  CA     : 13064   Min.   : 1040     CA - Los Angeles :  3653            
##  TX     : 11864   1st Qu.:27261     MA - Boston      :  2910            
##  FL     : 11155   Median :44309     GA - Atlanta     :  2630            
##  NY     :  9178   Mean   :47938     TX - Houston     :  2577            
##  IL     :  7909   3rd Qu.:72901     PA - Philadelphia:  2554            
##  PA     :  7804   Max.   :99835     TX - Dallas      :  2427            
##  (Other):102091                     (Other)          :146314            
##  Total.Discharges  Average.Covered.Charges Average.Total.Payments
##  Min.   :  11.00   Min.   :  2459          Min.   :  2673        
##  1st Qu.:  17.00   1st Qu.: 15947          1st Qu.:  5234        
##  Median :  27.00   Median : 25246          Median :  7214        
##  Mean   :  42.78   Mean   : 36134          Mean   :  9707        
##  3rd Qu.:  49.00   3rd Qu.: 43233          3rd Qu.: 11286        
##  Max.   :3383.00   Max.   :929119          Max.   :156158        
##                                                                  
##  Average.Medicare.Payments   Census.Region  
##  Min.   :  1149            MIDWEST  :39087  
##  1st Qu.:  4192            NORTHEAST:30230  
##  Median :  6158            SOUTH    :67038  
##  Mean   :  8494            WEST     :26710  
##  3rd Qu.: 10057                             
##  Max.   :154621                             
##                                             
##            Census.Region.Division    Federal.Region 
##  D5_SOUTCH_ATLANTIC   :34118      REGION_IV :37453  
##  D3_EAST_NORTH_CENTRAL:27434      REGION_V  :29686  
##  D2_MID-ATLANTIC      :21808      REGION_VI :20395  
##  D7_WEST_SOUTH_CENTRAL:19478      REGION_III:17911  
##  D9_PACIFIC           :17814      REGION_IX :17559  
##  D6_EAST_SOUTH_CENTRAL:13442      REGION_II :14004  
##  (Other)              :28971      (Other)   :26057  
##  Economic.Analysis.Region
##  SOUTHEAST  :48468       
##  GREAT LAKES:27434       
##  MIDEAST    :25994       
##  FAR WEST   :19016       
##  SOUTHWEST  :18152       
##  PLAINS     :11653       
##  (Other)    :12348

Due to the vast array of data and groups found in this particular data frame. We will begin our analysis per regions. We will start with “Census Region” since it encompasses all states within the four central regions of the USA, the divisions within regions, the Federal bank that is in charge of the state as well as the Regions established by the Bureau of Economic Analysis.

Univariate Plots Section

This graph contains the number of entry per each of the groups mentioned above.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

## Warning: Ignoring unknown parameters: binwidth, bins, pad

## Warning: Ignoring unknown parameters: binwidth, bins, pad

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Univariate Analysis Section

We will start the initial analysis by obtaining the frequency for each one of the variables of the columns:

  1. “Census.Region”

  2. “Census Region Division”

  3. “Federal Region”

  4. “Bureau of Economic Analysis Region”

library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
print(count(df_cms, vars = "Census.Region"))
##   Census.Region  freq
## 1       MIDWEST 39087
## 2     NORTHEAST 30230
## 3         SOUTH 67038
## 4          WEST 26710
print(count(df_cms, vars = "Census.Region.Division"))
##   Census.Region.Division  freq
## 1          D1_NEW_ENGLAD  8422
## 2        D2_MID-ATLANTIC 21808
## 3  D3_EAST_NORTH_CENTRAL 27434
## 4  D4_WEST_NORTH_CENTRAL 11653
## 5     D5_SOUTCH_ATLANTIC 34118
## 6  D6_EAST_SOUTH_CENTRAL 13442
## 7  D7_WEST_SOUTH_CENTRAL 19478
## 8            D8_MOUNTAIN  8896
## 9             D9_PACIFIC 17814
print(count(df_cms, vars = "Federal.Region"))
##    Federal.Region  freq
## 1        REGION_I  8422
## 2       REGION_II 14004
## 3      REGION_III 17911
## 4       REGION_IV 37453
## 5       REGION_IX 17559
## 6        REGION_V 29686
## 7       REGION_VI 20395
## 8      REGION_VII  8493
## 9     REGION_VIII  4283
## 10       REGION_X  4859
print(count(df_cms, vars = "Economic.Analysis.Region"))
##   Economic.Analysis.Region  freq
## 1                 FAR WEST 19016
## 2              GREAT LAKES 27434
## 3                  MIDEAST 25994
## 4              NEW ENGLAND  8422
## 5                   PLAINS 11653
## 6           ROCKY MOUNTAIN  3926
## 7                SOUTHEAST 48468
## 8                SOUTHWEST 18152

In the Census Region, the region South has the highest account of entries with 67038 and the lowest in the West with 26710. In the Divisions in the Census Region, the Division 5 South Atlantic has the hights entries with 34116, and the lowest in the Division 1 New England.

When considering Regions under the different Federal Banks, Region IV is the one with the highest entry and the region with the lowest entries is Region_XIII with 4283. Lastly, The Rocky Mountain Region of the Bureau of Economic Analysis is the region with the lowest entries in 3926. On the other hand, Southeast is the region with the highest entries with 48468.

Before we continue, we will get the summary of the Total Discharges, Average Covered Charges, Average Total Payments and Average Medicare Payments per Census Region.

by(df_cms$Total.Discharges, df_cms$Census.Region, summary)
## df_cms$Census.Region: MIDWEST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   17.00   28.00   43.87   50.00 1487.00 
## -------------------------------------------------------- 
## df_cms$Census.Region: NORTHEAST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   18.00   29.00   45.83   53.00 3383.00 
## -------------------------------------------------------- 
## df_cms$Census.Region: SOUTH
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   17.00   27.00   43.81   50.00 1344.00 
## -------------------------------------------------------- 
## df_cms$Census.Region: WEST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   15.00   23.00   35.13   39.00 1696.00
by(df_cms$Average.Covered.Charges, df_cms$Census.Region, summary)
## df_cms$Census.Region: MIDWEST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3611   14792   21413   29395   34654  353774 
## -------------------------------------------------------- 
## df_cms$Census.Region: NORTHEAST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3481   14964   25870   37123   45428  613927 
## -------------------------------------------------------- 
## df_cms$Census.Region: SOUTH
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2459   15416   23890   33361   40136  480540 
## -------------------------------------------------------- 
## df_cms$Census.Region: WEST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4202   23310   37111   51836   61668  929119
by(df_cms$Average.Total.Payments, df_cms$Census.Region, summary)
## df_cms$Census.Region: MIDWEST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2673    5062    6897    9228   10837  131187 
## -------------------------------------------------------- 
## df_cms$Census.Region: NORTHEAST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2765    5610    7824   10515   12138  140255 
## -------------------------------------------------------- 
## df_cms$Census.Region: SOUTH
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2683    4878    6686    8941   10460   99307 
## -------------------------------------------------------- 
## df_cms$Census.Region: WEST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2995    6134    8525   11419   13295  156158
by(df_cms$Average.Medicare.Payments, df_cms$Census.Region, summary)
## df_cms$Census.Region: MIDWEST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1327    4016    5850    7991    9613  130467 
## -------------------------------------------------------- 
## df_cms$Census.Region: NORTHEAST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1680    4597    6818    9366   11009  133177 
## -------------------------------------------------------- 
## df_cms$Census.Region: SOUTH
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1149    3849    5640    7746    9303   95701 
## -------------------------------------------------------- 
## df_cms$Census.Region: WEST
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1604    5034    7429   10125   11900  154621

EDA: Furthe Data Exploration

Bivariate Plots Section

We will start by analysing the following group of variables together:

  1. Average Covered Charges vs Average Medicare Payments: This comparison was chosen because Average Covered Charges are the average charge for services covered by Medicare for all discharges in the-DRG and Average Medicare Payments are the average amount that Medicare pays providers for Medicare’s share of the MS-DRG.
  2. Total Discharges vs Average Total Payments: This comparison was chosen because the total discharges account for the number of discharges billed by the provider for inpatient hospital services and the average total payments account for the total the average total payments to all providers for the MS-DRG; including partial coverage by the patient.
  3. Average Total Payments vs Average Medicare Payments.

Average Covered Charges vs Average Medicare Payments

CC-vs-MP:Plots

The first graph that is going to show the whole data from df_cms. The whole data is graphed in this way in order to acquire the whole pattern of the data between the two variables.

## Warning: Removed 896 rows containing missing values (geom_point).
## Warning: Removed 66 rows containing missing values (geom_point).
## Warning: Removed 262 rows containing missing values (geom_point).
## Warning: Removed 189 rows containing missing values (geom_point).
## Warning: Removed 379 rows containing missing values (geom_point).

As we can see from the graph West and Midwest regions, have a similar scatter plot, flatter and more left-leaning. However, Midwest has a more linear shape than the West region.

Northeast and South regions have similar scatterplots with widening x and y profile than other regions. However, the South region have a straight line. It will be interesting to know why it was developed.

CC-vs-MP:Analysis

We start the analysis by determining the Correlation between the two variables. We will perform a Pearson test between the two variables without subdividing the data into regions.

res <- cor.test(df_cms$Average.Covered.Charges, 
                df_cms$Average.Medicare.Payments, 
                    method = "pearson")

res
## 
##  Pearson's product-moment correlation
## 
## data:  df_cms$Average.Covered.Charges and df_cms$Average.Medicare.Payments
## t = 485.66, df = 163060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7669354 0.7709033
## sample estimates:
##       cor 
## 0.7689268

The test shows a cor value of 0.767. Values between 0.7 and 0.9 state that there is a steady uphill linear relationship between Average Covered Charges and Average Medicare payments.

For the moment we will only explore the correlation between the two variables, and in a subsequent analysis with more detail and in-depth data we will explore other analysis between these two variables

Total Discharges vs Average Total Payments

TD-vs-TP:Plots

This first graph is going to deliver an overall picture of the relation between the two data columns; including data patterns.

This graph shows that the majority of total discharges are fewer than 250 with an increase average total payments are mostly under 25000$.

The next graph will plot the same graph as above but with limits of x-value 500 and y-value 50000, as well as four separate graphs with the data frame gather selectively from the regions of the USA.

## Warning: Removed 990 rows containing missing values (geom_point).
## Warning: Removed 155 rows containing missing values (geom_point).
## Warning: Removed 279 rows containing missing values (geom_point).
## Warning: Removed 253 rows containing missing values (geom_point).
## Warning: Removed 303 rows containing missing values (geom_point).

All graphs share a similar pattern than the one shared from the pattern mention in the general and global graph. There are two distinct protrusions of data reflecting an increase of total discharges with approximately an average total payments of about 10000; with a constant increase of discharges, similar to a horizontal line. The pattern is repeated about 15000 Average Total Payments; it is less apparent in the West Region.

TP-vs-MP:Analysis

We start the analysis by determining the Correlation between the two variables. We will perform a Pearson test between the two variables without subdividing the data into regions.

res <- cor.test(df_cms$Total.Discharges, df_cms$Average.Total.Payments, 
                    method = "pearson")

res
## 
##  Pearson's product-moment correlation
## 
## data:  df_cms$Total.Discharges and df_cms$Average.Total.Payments
## t = -6.4699, df = 163060, p-value = 9.833e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02087213 -0.01116732
## sample estimates:
##        cor 
## -0.0160201

The test shows a cor value of -0.016. Values between 0 and -0.3 state that there is a weak negative linear relationship between Total Discharges and Average Total Payments.

It is unusual that this the correlation found between these variables; especially since these two variables involve the total of discharges and average total payments.

For the moment we will only explore the correlation between the two variables, and in a subsequent analysis with more detail and in-depth data, we will explore other analysis between these two variables.

Average Total Payments vs Average Medicare Payments

TP-vs-MP:Plots

This first graph is going to deliver an overall picture of the relation between the two data columns; including data patterns.

This graph shows a linear-like pattern. The majority of the entries are within the liner pattern. Also, the majority of entries are within x-value and y-value limits of 50000.

## Warning: Removed 779 rows containing missing values (geom_point).
## Warning: Removed 106 rows containing missing values (geom_point).
## Warning: Removed 243 rows containing missing values (geom_point).
## Warning: Removed 152 rows containing missing values (geom_point).
## Warning: Removed 278 rows containing missing values (geom_point).

All graphs show the same pattern. There are mild differences in the width. Further analysis could reveal more about this pattern.

TP-vs-MP:Analysis

We start the analysis by determining the Correlation between the two variables. We will perform a Pearson test between the two variables without subdividing the data into regions.

res <- cor.test(df_cms$Average.Total.Payments, df_cms$Average.Medicare.Payments, 
                    method = "pearson")

res
## 
##  Pearson's product-moment correlation
## 
## data:  df_cms$Average.Total.Payments and df_cms$Average.Medicare.Payments
## t = 2746.3, df = 163060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9892588 0.9894642
## sample estimates:
##      cor 
## 0.989362

The test shows a cor value of 0.989. Values between 0.7 and 0.9 state that there is a steady uphill linear relationship between Total Payments and Average Medicare Payments.

For the moment we will only explore the correlation between the two variables, and in a subsequent analysis with more detail and in-depth data, we will explore other analysis between these two variables.

Multivariate Data

Multivariate Data: Plot

Further plots

set.seed(2500)
ggplot(aes(x=Census.Region.Division, fill=Census.Region),data= df_cms)+
  geom_boxplot(aes(y= Total.Discharges))+
  coord_cartesian(ylim = c(0, 90))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Conclusion

The Dataset is one that shows a significant amount of data towards Medicare payment information in hospitals throughout the USA. It consists of 16 variables and 163065 entries. We explore the various relationships between total payments, Medicare payments, Total Discharges, and regions within the USA.

The dataset is vast and much more analysis could be performed, for example, the differences in payment from Medicare and Total payments made per procedure per hospitals in the same region as well as through states or regions in the USA.